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Abstract 

Video processing systems such as HEVC requiring low energy consumption needed for the 
multimedia market has lead to extensive development in fast algorithms for the efficient ap¬ 
proximation of 2-D DCT transforms. The DCT is employed in a multitude of compression 
standards due to its remarkable energy compaction properties. Multiplier-free approximate 
DCT transforms have been proposed that offer superior compression performance at very low 
circuit complexity. Such approximations can be realized in digital VLSI hardware using addi¬ 
tions and subtractions only, leading to significant reductions in chip area and power consumption 
compared to conventional DCTs and integer transforms. In this paper, we introduce a novel 8- 
point DCT approximation that requires only 14 addition operations and no multiplications. The 
proposed transform possesses low computational complexity and is compared to state-of-the-art 
DCT approximations in terms of both algorithm complexity and peak signal-to-noise ratio. The 
proposed DCT approximation is a candidate for reconfigurable video standards such as HEVC. 
The proposed transform and several other DCT approximations are mapped to systolic-array 
digital architectures and physically realized as digital prototype circuits using FPGA technology 
and mapped to 45 nm CMOS technology. 

Keywords Approximate DCT, low-complexity algorithms, image compression, HEVC, low power 

consumption 


1 Introduction 

Recent years have experienced a significant demand for high dynamic range systems that operate 
at high resolutions [T]. In particular, high-quality digital video in multimedia devices [2] and video- 
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over-Internet protocol networks [3] are prominent areas where such requirements are evident. Other 
noticeable fields are geospatial remote sensing [1], traffic cameras [5], automatic surveillance [I], 
homeland security [6], automotive industry [7], and multimedia wireless sensor networks [8], to 
name but a few. Often hardware capable of significant throughput is necessary; as well as allowable 
area-time complexity [8]. 

In this context, the discrete cosine transform (DOT) pi411| is an essential mathematical tool in 
both image and video coding PlIIHIS]. Indeed, the DOT was demonstrated to provide good energy 
compaction for natural images, which can be described by first-order Markov signals uniiiiiE]. 
Moreover, in many situations, the DOT is a very close substitute for the Karhunen-Loeve transform 
(KLT), which has optimal properties [9l llllll3lll4lll6j . As a result, the two-dimensional (2-D) version 
of the 8-point DCT was adopted in several imaging standards such as JPEG [IT], MPEG-1 [T8] . 
MPEG-2 [I9], H.261 [20|, H.263 [2X1122]. and H.264/AVC [231I2X]. 

Additionally, new compression schemes such as the High Efficiency Video Goding (HEVG) em¬ 
ploys DGT-like integer transforms operating at various block sizes ranging from 4x4 to 32x32 
pixels [25H2T]. The distinctive characteristic of HEVG is its capability of achieving high com¬ 
pression performance at approximately half the bit rate required by H.264/AVG with same image 
quality |25H27]. Also HEVG was demonstrated to be especially effective for high-resolution video 
applications m- However, HEVG possesses a significant computational complexity in terms of 
arithmetic operations |26H28j . In fact, HEVG can be 2-4 times more computationally demand¬ 
ing when compared to H.264/AVC [26]. Therefore, low complexity DGT-like approximations may 
benefit future video codecs including emerging HEVG/H.265 systems. 

Several efficient algorithms were developed and a noticeable literature is available [IOll29H35] . 
Although fast algorithms can significantly reduce the computational complexity of computing the 
DGT, floating-point operations are still required m- Despite their accuracy, floating-point opera¬ 
tions are expensive in terms of circuitry complexity and power consumption. Therefore, minimizing 
the number of floating-point operations is a sought property in a fast algorithm. One way of cir¬ 
cumventing this issue is by means of approximate transforms. 

The aim of this paper is two-fold. Eirst, we introduce a new DGT approximation that pos¬ 
sesses an extremely low arithmetic complexity, requiring only 14 additions. This novel transform 
was obtained by means of solving a tailored optimization problem aiming at minimizing the trans¬ 
form computational cost. Second, we propose hardware implementations for several 2-D 8-point 
approximate DGT. The approximate DGT methods under consideration are (i) the proposed trans¬ 
form; (ii) the 2008 Bouguezel-Ahmad-Swamy (BAS) DCT approximation [36]; (iii) the parametric 
transform for image compression m; (iv) the Cintra-Bayer (CB) approximate DCT based on the 
rounding-off function [38]; (v) the modified CB approximate DCT [39]; and (vi) the DCT approxi¬ 
mation proposed in [XO] in the context of beamforming. All introduced implementations are sought 
to be fully parallel time-multiplexed 2-D architectures for 8x8 data blocks. Additionally, the pro- 
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posed designs are based on successive calls of 1-D architectures taking advantage of the separability 
property of the 2-D DCT kernel. Designs were thoroughly assessed and compared. 

This paper unfolds as follows. In Section [2l we discuss the role of DCT-like fast algorithms 
for video CODECs while proposing some new possibilities for low-power video processing where 
rapid reconfiguration of the hardware realization is possible. In Section [3l we review selected ap¬ 
proximate methods for DCT computation and describe associate fast algorithms in terms of matrix 
factorizations. Section 0] details the proposed transform and its fast algorithm based on matrix fac¬ 
torizations. Section [5] discusses the computational complexity of the approximate DCT techniques. 
Performance measures are also quantified and evaluated to assess the proposed approximate DCT 
as well as the remaining selected approximations. In Section [6] digital hardware architectures for 
discussed algorithms are supplied both for I-D and 2-D analysis. Hardware resource consumptions 
using field programmable gate array (FPGA) and CMOS 45 nm application-specific integrated 
circuit (ASIC) technologies are tabulated. Conclusions and final remarks are in Section [71 

2 Reconfigurable DCT-like Fast Algorithms in Video CODECs 

In current literature, several approximate methods for the DCT calculation have been archived [Ill- 
While not computing the DCT exactly, such approximations can provide meaningful estimations 
at low-complexity requirements. In particular, some DCT approximations can totally eliminate 
the requirement for floating-point operations—all calculations are performed over a fixed-point 
arithmetic framework. Prominent 8-point approximation-based techniques were proposed in m 
[T5]I36I[44] . Works addressing 16-point DCT approximations are also archived in literature |43ll45ll46j . 

In general, these approximation methods employ a transformation matrix whose elements are 
defined over the set {0, ±1/2, ±1, ±2}. This implies null multiplicative complexity, because the 
required operations can be implemented exclusively by means of binary additions and shift opera¬ 
tions. Such DCT approximations can provide low-cost and low-power designs and effectively replace 
the exact DCT and other DCT-like transforms. Indeed, the performance characteristics of the low 
complexity DCT approximations appear similar to the exact DCT, while their associate hardware 
implementations are economical because of the absence of multipliers |14lll5l[551l43ll451 - l46] . As a 
consequence, some prospective applications of DCT approximations are found in real-time video 
transmission and processing. 

Emerging video standards such as HEVC provide for reconfigurable operation on-the-fly which 
makes the availability of an ensemble of fast algorithms and digital VLSI architectures a valuable 
asset for low-energy high-performance embedded systems. Eor certain applications, low circuit 
complexity and/or power consumption is the driving factor, while for certain other applications, 
highest picture quality for reasonably low power consumption and/or complexity may be more 
important. In emerging systems, it may be possible to switch modus operandi based on the de¬ 
manded picture quality vs available energy in the device. Such feature would be invaluable in high 
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quality smart video devices demanding extended battery life. Thus, the availability of a suite of 
fast algorithms and implementation libraries for several efficient DCT approximation algorithms 
may be a welcoming contribution. 

For example, in a future HEVC system, it may be possible to reconfigure the DCT engine to 
use a higher complexity DCT approximation which offers better signal-to-noise ratio (SNR) when 
the master device is powered by a remote power source, and then have the device seamlessly switch 
into a low complexity fast DCT algorithm when the battery storage falls below a certain threshold, 
for example m- Alternatively, the CODEC may be reconfigured in real-time to switch between 
different DCT approximations offering varying picture quality and power consumptions depending 
on the measured SNR of the incoming video stream, which would be content specific and very 
difficult to predict without resorting to real-time video metrics |48] . 

Furthermore, another possible application for a suite of DCT approximation algorithms in the 
light of reconfigurable video codecs is the intelligent intra-frame fast reconfiguration of the DCT 
core to take into account certain local frame information and measured SNR metrics. For example, 
certain parts of a frame can demand better picture quality (foreground, say) when compared to 
relatively unimportant part of the frame (background, say) [H]. In such a case, it may be possible 
to switch DCT approximations algorithms on an intra frame basis to take into account the varying 
demands for picture clarity within a frame as well as the availability of reconfigurable logic based 
digital DCT engines that support fast reconfiguration in real-time. 

3 Review of Approximate DCT Methods 

In this section, we review the mathematical description of the selected 8-point DCT approximations. 
All discussed methods here consist of a transformation matrix that can be put in the following 
format: 


[diagonal matrix] x [low-complexity matrix]. 

The diagonal matrix usually contains irrational numbers in the form Ij^/rn, where m is a small 
positive integer. In principle, the irrational numbers required in the diagonal matrix would re¬ 
quire an increased computational complexity. However, in the context of image compression, the 
diagonal matrix can simply be absorbed into the quantization step of JPEG-like compression pro¬ 
cedures [T5l[36l - [39lH2] . Therefore, in this case, the complexity of the approximation is bounded 
by the complexity of the low-complexity matrix. Since the entries of the low complexity matrix 
comprise only powers of two in {0, ±1/2, ±1, ±2}, null multiplicative complexity, is achieved. 

In the next subsections, we detail these methods in terms of its transformation matrices and the 
associated fast algorithms obtained by matrix factorization techniques. All derived fast algorithms 
employ sparse matrices whose elements are the above-mentioned powers of two. 
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3.1 Bouguezel-Ahmad-Swamy Approximate DCT 

In [36], a low-complexity approximate was introduced by Bouguezel et al. We refer to this ap¬ 
proximate DCT as BAS-2008 approximation. The BAS-2008 approximation Ci has the following 
mathematical structure: 


Cl =Di Ti =Di 
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2 2 
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1 -1 -1 
1-10 


1 1 1 n 
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where Di = diag . A fast algorithm for matrix Ti can be derived 

by means of matrix factorization. Indeed, Ti can be written as a product of three sparse matrices 
having {0,±1/2,±1} elements as shown below |36|: Ti = A 3 • A 2 • Ai, where Ai = 


I 4 I 4 

U -I 4 
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01100000 
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Matrices and denote the identity and counter-identity matrices of order n, respectively. It is 
recognizable that matrix Ai is the well-known decimation-in-frequency structure present in several 
fast algorithms m- 


3.2 Parametric Transform 

Proposed in 2011 by Bouguezel-Ahmad-Swamy 1321, the parametric transform is an 8 -point orthog¬ 
onal transform containing a single parameter a in its transformation matrix In this work, we 

refer to this method as the BAS-2011 transform. It is given as follows: 


C(“) = = 0 *-“^ • 


riiiiiiii"! 
11 0 0 0 0 - 1-1 

1 a —a —1 —1 —a a 1 
00100-100 
1 - 1-11 1 - 1-11 
0001-1000 
1-10 0 0 0 1-1 
_a —1 1 —a —a 1 —1 a 


where =diag h ■ Usually the parameter a is selected as 

a small integer in order to minimize the complexity of T^“\ In [37], suggested values are a € 
{0,1/2,!}. The value a = 1/2 will not be considered in our analyses because in hardware it 
represents a right-shift which may incur in computational errors. Another possible value that 
furnishes a low-complexity, error-free transform is a = 2. The matrix factorization of that 
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leads to its fast algorithm is [37]: •A 4 -Ai, where = diag ( } ii 


a 1 
— 1 a 



and A 4 


diag 


— 1 

0 

0 



\ 

oil 0 


1 1 


0 1-10 

5-1-2, 

-1 1 

) 

0 

0 

1 



/ 


Matrix Pi performs the simple permutation (1)(2 5 6 


4 8 7)(3), where cyclic notation is employed jlH p. 77]. This is a compact notation to denote 


permutation. In this particular case, it means that component indices are permuted according to 


2—)-5—)- 6 —)-4—)- 8 —>^7—)-2. Indices 1 and 3 are unchanged. Therefore, Pi represents no 
computational complexity. 


3.3 CB-2011 Approximation 

By means of judiciously rounding-off the elements of the exact DCT matrix, a DCT approximation 
was obtained and described in [38|. The resulting 8 -point approximation matrix is orthogonal and 
contains only elements in {0, ±1}. Clearly, it possesses very low arithmetic complexity |38] . The 
matrix derived transformation matrix C 2 is given by: 


C 2 — D 2 ■ T 2 — D 2 • 


ri 1 1 1 1 1 1 1 I 
11 1 0 0 -1 -1 -1 
10 0-1-10 0 1 
10-1-11 1 0-1 
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0-11-11-11 0 
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An efficient factorization for the fast algo¬ 


rithm for T 2 was proposed in [38| as described below: 
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following permutation: (1)(2 5 8 ) (3 7 6 4). 


T 2 = P 2 • Ae • A 5 • Ai, where A 5 = 
, — I, I 5 ). Matrix P 2 corresponds to the 


3.4 Modified CB-2011 Approximation 

The transform proposed in [39] is obtained by replacing elements of the CB-2011 matrix with zeros. 
The resulting matrix is given by: 


C 3 — D 3 • T 3 — D 3 • 


ri 1 1 1 1 1 1 1 I 
1000000-1 
10 0-1-10 0 1 
00-10 0 1 0 0 
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0-1000010 
0-11 0 0 1-10 
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where D 3 = diag 
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Matrix T 3 can be factorized into T 3 = P 2 • Ae • 


A 7 • Ai, where A 7 = diag ^ 
distinction of requiring only 
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— 13 ,1 ]. This particular DCT approximation has the 


additions for its computation [39] . 
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3.5 Approximate DCT in |40| 


In [1^, a DCT approximation tailored for a particular radio-frequency (RF) application was ob¬ 
tained in accordance with an exhaustive computational search. This transformation is given by 


C 4 


D 4 • T 4 — D 4 • 
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The fast algorithm for its computa¬ 


tion consists of the following matrix factorization: 
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Ag • As • Ai, where Ag = 
, and matrix P 3 denotes the 


4 Proposed Transform 

We aim at deriving a novel low-complexity approximate DCT. For such end, we propose a search 
over the 8 x 8 matrix space in order to find candidate matrices that possess low computation cost. 
Let us define the cost of a transformation matrix as the number of arithmetic operations required 
for its computation. One way to guarantee good candidates is to restrict the search to matrices 
whose entries do not require multiplication operations. Thus we have the following optimization 
problem: 


T* = arg min cost (T), 


( 1 ) 


where T* is the sought matrix and cost(T) returns the arithmetic complexity of T. Additionally, 
the following constraints were adopted: 

1. Elements of matrix T must be in {0, ±1, ±2} to ensure that resulting multiplicative complexity 
is null; 

2. We impose the following form for matrix T: 
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where Uj € { 0 , 1 , 2 }, for i = 0 , 1 ... , 6 ; 

3. All rows of T are non-null; 
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4. Matrix T • must be a diagonal matrix to ensure orthogonality of the resulting approxima¬ 
tion [50] , 

Constraint 2) is required to preserve the DCT-like matrix structure. We recall that the exact 
8 -point DCT matrix is given by [35]: 



■ 73 73 73 73 73 73 73 73 ■ 

70 72 74 76 -76 -74 -72 -70 

71 75 -75 -71 -71 -75 75 7l 

72 —76 —70 —74 74 70 76 —72 

73 -73 -73 73 73 -73 -73 73 ’ 

74 —70 76 72 —72 —76 70 —74 

75 -71 71 -75 -75 71 -71 75 

- 76 -74 72 -70 70 -72 74 -76 - 


where 7 ^ = cos(27r(A: -|- l)/32), A: = 0,1,..., 6 . 

Above optimization problem is algebraically intractable. Therefore we resorted to exhaustive 
computational search. As a result, eight candidate matrices were found, including the transform 
matrix proposed in [39]. Among these minimal cost matrices, we separated the matrix that presents 
the best performance in terms of image quality of compressed images according the JPEG-like 
technique employed in [36H39ll4TH4i] , and briefly reviewed in next Section [5] 

An important parameter in the image compression routine is the number of retained coefficients 
in the transform domain. In several applications, the number of retained coefficients is very low. For 
instance, considering 8 x 8 image blocks, (i) in image compression using support vector machine, 
only the first 8-16 coefficients were considered ISB; (ii) Mandyam et al. proposed a method 
for image reconstruction based on only three coefficients; and Bouguezel et al. employed only 
10 DCT coefficients when assessing image compression methods [411142] . Retaining a very small 
number of coefficients is also common for other image block sizes. In high speed face recognition 
applications, Pan et al. demonstrated that just 0.34%-24.26% out of 92x112 DCT coefficients are 
sufficient [5211^ . Therefore, as a compromise, we adopted the number of retained coefficients equal 
to 10, as suggested in the experiments by Bouguezel et al. [HJilS] . 

The solution of ([T]) is the following DCT approximation: 


C* = D* • T* = D* 


riiiiiiii"! 
010000-10 
10 0-1-10 0 1 
1000000-1 
1-1-11 1-1-11 
00 0 1-10 0 0 
0-11 0 0 1-10 
00100-100 
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5 Computational Complexity and Performance Analysis 


The performance of the DCT approximations is often a trade-off between accuracy and computa¬ 
tional complexity of a given algorithm m- In this section, we assess the computational complexity 
of the discussed methods and objectively compare them. Additionally, we separate several perfor¬ 
mance measures to quantify how “close” each approximation are; and to evaluate their performance 
as image compression tool. 

5.1 Arithmetic Complexity 

We adopt the arithmetic complexity as figure of merit for estimating the computational com¬ 
plexity. The arithmetic complexity consists of the number of elementary arithmetic operations 
(additions/subtractions, multiplications/divisions, and bit-shift operations) required to compute a 
given transformation. In other words, in all cases, we focus our attention to the low-complexity 
matrices: Ti, T(“), T2, T3, T4, and the proposed matrix T*. For instance, in the context of image 
and video compression, the complexity of the diagonal matrix can be absorbed into the quantiza¬ 
tion step [T5l[36l[3^H2] : therefore the diagonal matrix does not contribute towards an increase of 
the arithmetic complexity [381139] . 

Because all considered DCT approximations have null multiplicative complexity, we resort 
to comparing them in terms of their arithmetic complexity assessed by the number of addi¬ 
tions/subtractions and bit-shift operations. Table [1] displays the obtained complexities. We also 
include the complexity of the exact DCT calculated (i) directly from definition [10] and (ii) accord¬ 
ing to Arai fast algorithm for the exact DCT [33] . 

We derived a fast algorithm for the proposed transform, employing only I 4 additions. This 
is the same very low-complexity exhibited by the Modified CB-2011 approximation [39|. To the 
best of our knowledge these are DCT approximations offering the lowest arithmetic complexity in 
literature. 

5.2 Comparative Performance 

We employed three classes of assessment tools: (i) matrix proximity metrics with respect to the 
exact DCT matrix; (ii) transform-related measures; and (hi) image quality measures in image 
compression. For the first class of measures, we adopted the total error energy |38j and the mean- 
square error (MSE) |11II13| . For transform performance evaluation, we selected the transform coding 
gain (Cg) [111113] and the transform efficiency (r/) [111154] . Finally, for image quality assessment 
we employed the peak SNR (PSNR) [55l[56| and the universal quality index (UQI) [57]. Next 
subsections furnish a brief description of each of these measures. 
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Table 1: Arithmetic complexity analysis 


Method 

Mult 

Add 

Shifts 

Total 

Exact DCT (Definition) [T0| 

64 

56 

0 

120 

Arai algorithm (exact) [33| 

5 

29 

0 

34 

BAS-2008 [36] 

0 

18 

2 

20 

BAS-2011 [37] with a = 0 

0 

16 

0 

16 

BAS-20I1 [37] with a = 1 

0 

18 

0 

18 

BAS-2011 [37] with a = 2 

0 

18 

2 

20 

CB-2011 [38] 

0 

22 

0 

22 

Modified CB-2011 |39j 

0 

14 

0 

14 

Approximate DCT in [40] 

0 

24 

6 

30 

Proposed transform 

0 

14 

0 

14 


5.2.1 Matrix Proximity Metrics 

Let C be an approximate DCT matrix and C be the exact DCT matrix. The total error energy 
is an energy-based measure for quantifying the “distance” between C and C. It is described as 
follows [38] , 

Let T) is the transfer function of the m-th row of a given matrix T as shown below: 


8 

exp {-j{n-l)uj], m = 1,2 ,... ,8, 

71=1 


where j = y/—! and tm,n is the (m, n)-th element of T. Then the row-wise error energy related to 
the difference between C and C is furnished by: 


Dm{co;C)^ 





m = 1, 2 ,..., 8. 


We note that, for each row m at any angular frequency uj € [0, tt] in radians per sample, Dm{oJ', C — 
C) expression quantifies how discrepant a given approximation matrix C is from the matrix C. In 
this way, a total error energy departing from the exact DCT can be obtained by [38]: 


e = c) dcj. 

Above integral can be computed by numerical quadrature methods [58] . 

For the MSE evaluation, we assume that the input signal is a first-order Gaussian Markov 
process with zero-mean, unit variance, and correlation equal to 0.95 mm- Typically images satisfy 
these requirements m- The MSE is mathematically detailed in mm and should be minimized 
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to maintain the compatibility between the approximation and the exact DCT outputs m- 

5.2.2 Transform-related Measures 

The transform coding gain is an important figure of merit to evaluate the coding efficiency of a 
given transform as a data compression tool. Its mathematical description can be found in mm- 
Another measure to evaluate the transform coding gain is the transform efficiency [IIl[5l]. The 
optimal KLT converts signals into completely uncorrelated coefficients that has transform efficiency 
equal to 100, whereas the DCT-II achieves a transform efficiency of 93.9911 for Markovian data at 
correlation coefficient of 0.95 m- 

5.2.3 Image Quality Measures iu JPEG-like Compressiou 

For quality analysis, images were submitted to a JPEG-like technique for image compression. The 
resulting compressed images are then assessed for image degradation in comparison to the original 
input image. Thus, 2-D versions of the discussed methods are required. An 8x8 image block A 
has its 2-D transform mathematically expressed by [59]: 

T-A-TT, (2) 

where T is a considered transformation. Input images were divided into 8x8 sub-blocks, which were 
submitted to the 2-D transforms. For each block, this computation furnished 64 coefficients in the 
approximate transform domain for a particular transformation. According to the standard zigzag 
sequence |60j . only the 2 < r < 20 initial coefficients in each block were retained and employed 
for image reconstruction |38] . This range of r corresponds to high compression. All the remaining 
coefficients were set to zero. The inverse procedure was then applied to reconstruct the processed 
image. 

Subsequently, recovered images had their PSNR |55| and UQI [57] evaluated. The PSNR is 
a standard quality metric in the image processing literature |56] . and the UQI is regarded as a 
more sophisticate tool for quality assessment, which takes into consideration structural similarities 
between images under analysis [381157] . 

This methodology was employed in m, supported in [MIETIIIIHIS], and extended in jSS] 
[ 39 ] . However, in contrast to the JPEG-like experiments described in [ 361 I 371 I 4 TIH 3 ] . the extended 
experiments considered in [SSlES] adopted the average image quality measure from a collection of 
representative images instead of resorting to measurements obtained from single particular images. 
This approach is less prone to variance effects and fortuitous input data, being more robust m- For 
the above procedure, we considered a set of 45 8-bit greyscale 512 x 512 standard images obtained 
from a public image bank [62] . 
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Table 2: Accuracy measures of discussed methods 


Method 

e 

MSE (xlO-2) 

Cg 

r] 

PSNR 

UQI 

Exact DCT 

0.000 

0.000 

8.826 

93.991 

28.336 

0.733 

BAS-2008 [36] 

5.929 

2.378 

8.120 

86.863 

27.245 

0.686 

BAS-2011 [37] with a = 0 

26.864 

7.104 

7.912 

85.642 

26.918 

0.669 

BAS-2011 [37] with a = 1 

26.864 

7.102 

7.913 

85.380 

26.902 

0.668 

BAS-2011 [37] with a = 2 

27.922 

7.832 

7.763 

84.766 

26.299 

0.629 

CB-2011 [38] 

1.794 

0.980 

8.184 

87.432 

27.369 

0.697 

Modified CB-2011 [39] 

8.659 

5.939 

7.333 

80.897 

25.224 

0.563 

Approximate DCT in [40] 

0.870 

0.621 

8.344 

88.059 

27.567 

0.701 

Proposed transform 

11.313 

7.899 

7.333 

80.897 

25.726 

0.586 


5.3 Performance Results 

Figure [T] presents the resulting average PSNR and average UQI absolute percentage error (APE) 
relative to the DCT, for r = 2,3, ...,20, i.e., for high compression ratios [38]. The proposed 
transform could outperform the Modified CB-2011 approximation for 10 < r < 15, i.e., when 
84.38% to 76.56% of the DCT coefficients are discarded. Such high compression ratios are employed 
in several applications [4T1ICT - I531I63] . 

Table|2]shows the performance measures for the considered transforms. Average PSNR and UQI 
measures are presented for all considered images at a selected high compression ratio r = 10. The 
approximate transform proposed in [40] could outperform remaining methods in terms of proximity 
measures (total energy error and MSE) when compared to the exact DCT. It also furnished good 
image quality measure results (average PSNR = 27.567dB). However, at the same time, it is the 
most expensive approximation measured by its computational cost as shown in Table [TJ 

On the other hand, the transforms with lowest arithmetic complexities are the Modihed CB-2011 
approximation and new proposed transform, both requiring only 14 additions. The new transform 
could outperform the Modified CB-2011 approximation as an image compression tool as indicated 
by the PSNR and UQI values. 

A qualitative comparison based on the resulting compressed image Lena [62| obtained from the 
above describe procedure for r = 10 is shown in Eig. [2j 

Fig. [1] and Table [2] illustrate the usual trade-off between computational complexity and per¬ 
formance. For instance, although BAS-2011 (for a = 0) could yield a better PSNR figure when 
compared with the proposed algorithm, it is computationally more demanding (about 14.3% more 
operations) and its coding gain and transform efficiency are improved in only 7.9% and 5%, respec¬ 
tively. In contrast, the proposed algorithm requires only 14 additions, which can lead to smaller, 
faster and more energy efficient circuitry designs. In the next section, we offer a comprehensive 
hardware analysis and comparison of the discussed algorithms with several implementation specific 
figures of merit. 
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(a) Average PSNR 



(b) Average UQI absolute percentage error relative to the DCT 


Figure 1: Image quality measures for several compression ratios. 
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(d) Modified CB-2011 [39] (e) Approximate DCT in [40] (f) Proposed transform 

Figure 2: Compressed Lena image using several DCT approximations. Compression ratio is 84.375% 
(r = 10). 
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Figure 3: RD curves for ‘BasketballPass’ test sequence. 

We also notice that although the proximity of the exact DCT—as measured by the MSE—is a 
good characteristic, it is not the defining property of a good DCT approximation, specially in image 
compression applications. A vivid example of this—seemingly counter-intuitive phenomenon—is 
the BAS series of DCT approximation. Such approximations possess comparatively large values of 
proximity measures (e.g., MSE) when compared with the exact DCT matrix. Nevertheless, they 
exhibit very good performance in image compression application. Results displayed in Table II 
illustrate this behavior. 

5.4 Implementation in Real Time Video Compression Software 

The proposed approximate DCT transform was embedded into an open source HEVC standard 
reference software |64] in order to assess its performance in real time video coding. The original 
integer transform prescribed in the selected HEVC reference software is a scaled approximation of 
Chen DCT algorithm [65], which employs 26 additions. For comparison, the proposed approximate 
DCT requires only 14 additions. Both algorithms were evaluated for their effect on the overall 
performance of the encoding process by obtaining rate-distortion (RD) curves for standard video 
sequences. The curves were obtained by varying the quantization point (QP) from 0 to 50 and 
obtaining the PSNR of the proposed approximate transform with reference to the Chen DCT 
implementation, which is already implemented in the reference software, along with the bits/frame 
of the encoded video. The PSNR computation was performed by taking the average PSNR obtained 
from the three channels YCbCr of the color image, as suggested in [661 p. 55]. Fig. [3] depicts the 
obtained RD curves for the ‘BasketballPass’ test sequence. Fig. |4]shows particular 416x240 frames 
for QP G {0,32,50} when the proposed approximate DCT and the Chen DCT are considered. 

The RD curves reveals that the difference in the rate points of Chen DCT and proposed ap- 
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(c) Chen DCT (QP = 32) 


(d) Proposed DCT (QP = 32) 



(e) Chen DCT (QP = 50) (f) Proposed DCT (QP = 50) 


Figure 4; Selected frames from ‘BasketballPass’ test video coded by means of the Chen DCT and 
the proposed DCT approximation for QP = 0 (a-b), QP = 32 (c-d), and QP = 50 (e-f). 
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proximation is negligible. In fact, the mean absolute difference was 0.1234 dB, which is very low. 
Moreover, the frames show that both encoded video streams using the above two DCT trans¬ 
forms are almost identical. For each QP value, the PSNR values between the resulting frames were 
82.51 dB, 42.26 dB, and 36.38 dB, respectively. These very high PSNR values confirm the adequacy 
of the proposed scheme. 


6 Digital Architectures and Realizations 

In this section we propose architectures for the detailed 1-D and 2-D approximate 8-point DCT. We 
aim at physically implementing ([2]) for various transformation matrices. Introduced architectures 
were submitted to (i) Xilinx FPGA implementations and (ii) CMOS 45 nm application specific 
integrated circuit (ASIC) implementation up to the synthesis level. 

This section explores the hardware utilization of the discussed algorithms while providing a 
comparison with the proposed novel DCT approximation algorithm and its fast algorithm realiza¬ 
tion. Our objective here is to offer digital realizations together with measured or simulated metrics 
of hardware resources so that better decisions on the choice of a particular fast algorithm and its 
implementation can be reached. 


6.1 Proposed Architectures 

We propose digital computer architectures that are custom designed for the real-time implementa¬ 
tion of the fast algorithms described in Section [3l The proposed architectures employs two parallel 
realizations of DCT approximation blocks, as shown in Fig. [5j 

The 1-D approximate DCT blocks (Fig. [5]) implement a particular fast algorithm chosen from 
the collection described earlier in the paper. The first instantiation of the DCT block furnishes a 
row-wise transform computation of the input image, while the second implementation furnishes a 
column-wise transformation of the intermediate result. The row- and column-wise transforms can 
be any of the DCT approximations detailed in the paper. In other words, there is no restriction 
for both row- and column-wise transforms to be the same. However, for simplicity, we adopted 
identical transforms for both steps. 

Between the approximate DCT blocks a real-time row-parallel transposition buffer circuit is 
required. Such block ensures data ordering for converting the row-transformed data from the first 
DCT approximation circuit to a transposed format as required by the column transform circuit. 
The transposition buffer block is detailed in Fig. O 

The digital architectures of the discussed approximate DCT algorithms were given hardware 
signal flow diagrams as listed below: 


1. Proposed novel algorithm and architecture shown in Fig. 


7(a) 
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Figure 5: Two-dimensional approximate transform by means of 1-D approximate transform. Sig¬ 
nal Xkfl,Xk^i, ■ ■ ■ corresponds to the rows of the input image; X^^i, ■ ■ ■ indicates the trans¬ 

formed rows; Xqj,Xij, ... indicates the columns of the transposed row-wise transformed image; 
and ... indicates the columns of the final 2-D transformed image. If i = 0,1, 2, 3,..., 

then indices j and k satisfy j = i (mod 8) and k = [(| 8)i]/8 (mod 8). 



Figure 6: Details of the transposition buffer block. 


18 





















































































































































































































































































































































































































































2. BAS-2008 architecture shown in Fig. 7(b) 


3. BAS-2011 architecture shown in Fig. 7(c) 


4. CB-2011 architecture shown in Fig. 7(d) 


5. Modified CB-2011 architectnre shown in Fig. 7(e) 


6 . Architecture for the algorithm in [40] shown in Fig. |7(f)[ 


The circnitry sections associated to the constitnent matrices of the discussed factorizations are 
emphasized in the figures in bold or dashed boxes. 


6.2 Xilinx FPGA Implementations 

Discussed methods were physically realized on a FPGA based rapid prototyping system for varions 
register sizes and tested nsing on-chip hardware-in-the-loop co-simnlation. The architectnres were 
designed for digital realization within the MATLAB environment using the Xilinx System Generator 
(XSG) with synthesis options set to generic VHDL generation. This was necessary because the auto¬ 
generated register transfer langnage (RTF) hardware descriptions are targeted on both FPGAs as 
well as custom silicon using standard cell ASIC technology. 

The proposed architectures were physically realized on Xilinx Virtex-6 XC6VSX475T-2ffll56 
device. The architectnres were realized with fine-grain pipelining for increased thronghpnt. Clocked 
registers were inserted at appropriate points within each fast algorithm in order to reduce the critical 
path delay as much as possible at a small cost to total area. It is expected that the additional 
logic overheard dne to fine grain pipelining is marginal. Realizations were verified on FPGA chip 
using a Xilinx ML605 board at a clock freqnency of 100 MHz. Measured results from the FPGA 
realization were achieved nsing stepped hardware-in-the-loop verification. 

Several input precision levels were considered in order to investigate the performance in terms 
of digital logic resource consumptions at varied degrees of nnmerical accnracy and dynamic range. 
Adopting system word length L € {4,8,12,16}, we applied 10,000 random 8-point input test vectors 
using hardware co-simulation. The test vectors were generated from within the MATLAB envi¬ 
ronment and routed to the physical FPGA device using JTAG [67j based hardware co-simnlation. 
JTAG is a digital commnnication standard for programming and debngging reconfigurable devices 
such as Xilinx FPGAs. 

Then the measured data from the FPGA was routed back to MATLAB memory space. Each 
FPGA implementation was evalnated for hardware complexity and real-time performance using 
metrics such as configurable logic blocks (GLB) and flip-flop (FF) connt, critical path delay (Tcpd) 
in ns, and maximum operating frequency (Tmax) in MHz. The nnmber of available GLBs and FFs 
were 297,600 and 595,200, respectively. 
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(a) Proposed approximate transform (b) BAS-2008 approximate DCT (Ti). 

(T*). 



Ai A4 Q<“) 


(c) BAS-2011 approximate DCT (T^“^) 
where m € {— 00 ,0,1}. 




Figure 7: Digital architecture for considered DCT approximations. 
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Results are reported in Table [3l Quantities were obtained from the Xilinx FPGA synthesis and 
place-route tools by accessing the xflow.results report file for each run of the design flow. In 
addition, the static {Qp) and dynamic power {Dp) consumptions were estimated using the Xilinx 
XPower Analyzer. 

From Table[3]it is evident that the proposed transform and the modified CB-2011 approximation 
are faster than remaining approximations. Moreover, these two particular designs achieve the lowest 
consumption of hardware resources when compared with remaining designs. 

6.3 CMOS 45 nm ASIC Implementation 

The digital architectures were first designed using Xilinx System Generator tools within the Mat- 
lab/Simulink environment. Thereafter, the corresponding circuits were simulated using bit-true 
cycle-accurate models within the Matlab/Simulink software framework. The architectures were 
then converted to corresponding digital hardware description language designs using the auto- 
generate feature of the System Generator tool. The resulting hardware description language code 
led to physical implementation of the architectures using Xilinx FPGA technology, which in turn 
led to extensive hardware co-simulation on FPGA chip. Hardware co-simulation was used for ver- 
ihcation of the hardware description language designs which were contained in register transfer 
language (RTF) libraries. Thus, the above mentioned verified RTL code for each of the 2-D archi¬ 
tectures was ported to the Cadence RTL Compiler environment for mapping to application specific 
CMOS technology. To guarantee that the auto-generated RTL could seamlessly compile in the 
CMOS design environment, we ensured that RTL code followed a behavioral description which did 
not contain any FPGA specific (vendor specific) instructions. By adopting standard IEEE 1164 li¬ 
braries and behavioral RTL, the resulting code was compatible with Cadence Encounter for CMOS 
standard cell synthesis. 

Eor this purpose, we used EreePDK, a free open-source ASIC standard cell library at the 45 nm 
node [BH]. The supply voltage of the CMOS realization was fixed at Vdd = 1.1 V during estimation 
of power consumption and logic delay. The adopted figures of merit for the ASIC synthesis were: 
area {A) in mm^, critical path delay (T) in ns, area-time complexity {AT) in mm^ • ns, dynamic 
power consumption in watts, and area-time-squared complexity (AT^) in mm^ • ns^. Results are 
displayed in Table H] and [S] 

The AT complexity is an adequate metric when the chip area is more relevant than speed or 
computational throughput. On the other hand, is employed when real-time speed is the most 
important driving force for the optimizations in the logic designs. In all cases, clear improvements in 
maximum real-time clock frequency is predicted over the same RTL targeted at FPGA technology. 
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Table 3: Hardware Resource Consumption using Xilinx Virtex-6 XC6VSX475T-2ffll56 device 


L CLB FF Qp (W) Dp (W) T^pd Rnax 


BAS-2008 Algorithm 


4 

395 

784 

5.154 


0.918 

2.350 

401.7 

8 

613 

1123 

5.168 


1.105 

2.573 

367.1 

12 

821 

1523 

5.184 


1.301 

2.930 

337.8 

16 

1029 

1915 

5.187 


1.344 

3.254 

284.0 



BAS-2011 for 

a 

= 0 



4 

335 

877 

5.142 


0.767 

2.340 

386.4 

8 

535 

1276 

5.161 


1.015 

2.600 

356.2 

12 

728 

1732 

5.180 


1.260 

2.822 

337.4 

16 

919 

2187 

5.198 


1.486 

2.981 

325.2 



BAS-2011 for 

a 

= 1 



4 

387 

1019 

5.146 


0.811 

2.413 

396.7 

8 

605 

1453 

5.165 


1.065 

2.513 

361.4 

12 

813 

1949 

5.179 


1.247 

2.962 

329.4 

16 

1021 

2445 

5.198 


1.483 

2.987 

316.9 



BAS-2011 for 

a 

= 2 



4 

385 

1019 

5.146 


0.818 

2.371 

402.9 

8 

603 

1453 

5.163 


1.042 

2.584 

364.7 

12 

812 

1950 

5.190 


1.378 

2.618 

353.1 

16 

1019 

2445 

5.201 


1.527 

3.006 

326.5 

CB-2011 Algorithm 

4 

452 

883 

5.141 


0.750 

2.518 

363.4 

8 

702 

1257 

5.151 


0.876 

3.065 

303.1 

12 

950 

1709 

5.162 


1.029 

3.466 

270.6 

16 

1198 

2162 

5.187 


1.341 

3.610 

256.0 



Approximate DCT 

in [40] 



4 

513 

1040 

5.158 


0.972 

2.545 

387.8 

8 

779 

1471 

5.173 


1.170 

2.769 

351.0 

12 

1036 

1968 

5.181 


1.262 

2.945 

314.9 

16 

1291 

2463 

5.200 


1.514 

3.205 

298.0 


Modified CB-2011 approximation 


4 

297 

652 

5.153 

0.903 

2.384 

399.7 

8 

481 

961 

5.177 

1.214 

2.523 

391.2 

12 

657 

1329 

5.191 

1.390 

2.693 

354.0 

16 

834 

1698 

5.219 

1.752 

2.829 

345.5 


Proposed Transform 


4 

303 

651 

5.146 

0.818 

2.344 

404.0 

8 

487 

963 

5.167 

1.092 

2.470 

385.1 

12 

663 

1329 

5.185 

1.322 

2.524 

353.7 

16 

839 

1697 

5.203 

1.551 

2.818 

341.8 
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Table 4: Hardware resource consumption for CMOS 45nm ASIC implementation 


L 

ASIC 

Gates 

Area 

Tcpd 

AT 

AT^ 

Fmax 

BAS-2008 Algorithm 

4 

27792 

0.123 

1.140 

0.140 

0.160 

877.2 

8 

44654 

0.192 

1.204 

0.231 

0.278 

830.6 

12 

61388 

0.262 

1.216 

0.319 

0.388 

822.4 

16 

78281 

0.332 

1.236 

0.411 

0.508 

809.1 

BAS-2011 for a = 0 

4 

26299 

0.114 

1.135 

0.129 

0.147 

881.1 

8 

42313 

0.182 

1.147 

0.209 

0.239 

871.8 

12 

58342 

0.250 

1.225 

0.306 

0.375 

816.3 

16 

74062 

0.317 

1.310 

0.415 

0.544 

763.4 

BAS-2011 for a = 1 

4 

25940 

0.108 

1.106 

0.120 

0.133 

904.2 

8 

40330 

0.166 

1.125 

0.187 

0.210 

888.9 

12 

53728 

0.225 

1.170 

0.263 

0.308 

854.7 

16 

67860 

0.283 

1.200 

0.339 

0.407 

833.3 

BAS-2011 for a = 2 

4 

25554 

0.109 

1.117 

0.122 

0.136 

895.3 

8 

39321 

0.167 

1.132 

0.189 

0.214 

883.4 

12 

53950 

0.226 

1.175 

0.265 

0.312 

851.1 

16 

67979 

0.284 

1.201 

0.341 

0.409 

832.6 

CB-2011 Algorithm 

4 

30319 

0.132 

1.167 

0.154 

0.180 

856.9 

8 

48556 

0.209 

1.192 

0.249 

0.296 

838.9 

12 

66956 

0.285 

1.221 

0.348 

0.425 

819.0 

16 

85873 

0.363 

1.240 

0.450 

0.558 

806.5 

Approximate DCT in [40] 

4 

35141 

0.151 

1.141 

0.173 

0.197 

876.4 

8 

53624 

0.230 

1.211 

0.278 

0.337 

825.8 

12 

73224 

0.310 

1.234 

0.383 

0.473 

810.4 

16 

92697 

0.391 

1.242 

0.486 

0.603 

805.2 


Modified CB-2011 Approximation 


4 

24777 

0.107 

1.105 

0.119 

0.131 

905.0 

8 

40746 

0.175 

1.128 

0.197 

0.222 

886.5 

12 

56644 

0.242 

1.164 

0.282 

0.328 

859.1 

16 

73702 

0.314 

1.177 

0.369 

0.434 

849.6 


Proposed Transform 


4 

24817 

0.107 

1.110 

0.119 

0.132 

900.9 

8 

40705 

0.175 

1.129 

0.197 

0.223 

885.7 

12 

56703 

0.242 

1.165 

0.282 

0.329 

858.4 

16 

73906 

0.314 

1.174 

0.368 

0.432 

851.8 
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Table 5: Power consumption for CMOS 45nm ASIC implementation 


L 

QpimW) 

Pp(W) 

L 

QpimW) 

DpiW) 

BAS-2008 Algorithm 


CB-2011 Algorithm 

4 

1.00 

0.18 

4 

1.08 

0.24 

8 

1.56 

0.33 

8 

1.706 

0.36 

12 

2.13 

0.43 

12 

2.32 

0.48 

16 

2.70 

0.55 

16 

2.94 

0.60 

BAS-2011 for 

a = 0 

Approximate DCT in [40] 

4 

0.94 

0.21 

4 

1.23 

0.28 

8 

1.48 

0.33 

8 

1.87 

0.39 

12 

2.04 

0.42 

12 

2.52 

0.52 

16 

2.59 

0.50 

16 

3.17 

0.64 

BAS-2011 for 

a = 1 


Modified CB-2011 

4 

0.88 

0.24 

4 

0.88 

0.20 

8 

1.34 

0.36 

8 

1.42 

0.32 

12 

1.81 

0.40 

12 

1.98 

0.43 

16 

2.28 

0.48 

16 

2.55 

0.55 

BAS-2011 for 

a = 2 


Proposed Transform 

4 

0.89 

0.20 

4 

0.88 

0.20 

8 

1.35 

0.30 

8 

1.42 

0.32 

12 

1.82 

0.39 

12 

1.98 

0.43 

16 

2.29 

0.48 

16 

2.55 

0.55 


7 Conclusion 

In this paper, we proposed (i) a novel low-power 8-point DCT approximation that require only 
14 addition operations to computations and (ii) hardware implementation for the proposed trans¬ 
form and several other prominent approximate DCT methods, including the designs by Bouguezel- 
Ahmad-Swamy. We obtained that all considered approximate transforms perform very close to 
the ideal DCT. However, the modified CB-2011 approximation and the proposed transform possess 
lower computational complexity and are faster than all other approximations under consideration. 
In terms of image compression, the proposed transform could outperform the modified CB-2011 
algorithm. Hence the new proposed transform is the best approximation for the DCT in terms of 
computational complexity and speed among the approximate transform examined. 

Introduced implementations address both 1-D and 2-D approximate DCT. All the approxima¬ 
tions were digitally implemented using both Xilinx FPGA tools and CMOS 45 nm ASIC technology. 
The speeds of operation were much greater using the CMOS technology for the same function word 
size. Therefore, the proposed architectures are suitable for image and video processing, being 
candidates for improvements in several standards including the HEVC. 

Future work includes replacing the FreePDK standard cells with highly optimized proprietary 
digital libraries from TSMC PDK and continuing the CMOS realization all the way up to 
chip fabrication and post-fab test on a measurement system. Additionally, we intend to develop 
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the approximate versions for the 4-, 16-, and 32-point DCT as well as to the 4-point discrete sine 
transform, which are discrete transforms required by HEVC. 
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